当前位置: 首页 > 期刊 > 《基因杂志》 > 2003年第1期 > 正文
编号:10585743
On Marker-Assisted Prediction of Genetic Value: Beyond the Ridge
http://www.100md.com 《基因杂志》2003年第1期
     a Department of Animal Sciences, University of Wisconsin, Madison, Wisconsin 53706,b Station d'Amelioration Génétique des Animaux, Institut National de la Recherche Agronomique, 31326 Castanet-Tolosan, France;, http://www.100md.com

    c Departamento de Mejora Genética Animal, Instituto Nacional de Investigaciones Agrarias, 28040-Madrid, Spain;, http://www.100md.com

    ABSTRACT;, http://www.100md.com

    Marked-assisted genetic improvement of agricultural species exploits statistical dependencies in the joint distribution of marker genotypes and quantitative traits. An issue is how molecular (e.g., dense marker maps) and phenotypic information (e.g., some measure of yield in plants) is to be used for predicting the genetic value of candidates for selection. Multiple regression, selection index techniques, best linear unbiased prediction, and ridge regression of phenotypes on marker genotypes have been suggested, as well as more elaborate methods. Here, phenotype-marker associations are modeled hierarchically via multilevel models including chromosomal effects, a spatial covariance of marked effects within chromosomes, background genetic variability, and family heterogeneity. Lorenz curves and Gini coefficients are suggested for assessing the inequality of the contribution of different marked effects to genetic variability. Classical and Bayesian methods are presented. The Bayesian approach includes a Markov chain Monte Carlo implementation. The generality and flexibility of the Bayesian method is illustrated when a Lorenz curve is to be inferred.

    THE availability of a plethora of markers has led to consideration of the issue of the extent to which molecular information can be used to advantage in genetic improvement programs of agricultural species, such as maize or dairy cattle. There is a large body of literature on this matter (e.g., SOLLER and BECKMANN 1983 ; SMITH and SIMPSON 1986 ; LANDE and THOMPSON 1990 ); WHITTAKER 2001 gives a review. Marker-assisted selection can be effective as long as part of the genetic variance can be associated with segregating marker loci (LANDE and THOMPSON 1990 ).am21+v2, 百拇医药

    The basic idea in marker-assisted selection is to exploit statistical dependencies (linkage disequilibrium) existing in the joint distribution of marker and quantitative trait loci (QTL) genotypes. For example, when two inbred lines are crossed, the disequilibrium is manifest in the F2 generation. On the other hand, when there is linkage equilibrium at the population level, only the joint distribution of marker and QTL genotypes within a family is nontrivial (OLLIVIER 1998 ). GEORGES et al. 1995 have exploited high levels of within-family disequilibria in dairy cattle populations for QTL mapping. Most livestock populations have some disequilibrium due at least to chance (small effective size), as shown by FARNIR et al. 2000 .

    Linkage disequilibrium between markers and QTL can be used for two main purposes: (1) to infer genomic location and effects of QTL affecting a trait and (2) to arrive at improved (in some statistical sense) predictions of genetic merit of candidates for selection in a breeding program. These two objectives may not be disjoint (e.g., FERNANDO and GROSSMAN 1989 ). However, genetic cartography of QTL is not a requirement for prediction of genetic merit or marker-assisted selection (LANGE and WHITTAKER 2001 ). In fact, remarkable advances have been made in prediction of breeding values in livestock since the introduction of the best linear unbiased predictor (BLUP; HENDERSON et al. 1959 ; HENDERSON 1973 ). This method uses only phenotypic and pedigree information, with the QTL viewed only in an abstract manner. On the other hand, it has been argued and shown, at least in simulations, that molecular information may enhance the accuracy of selection (e.g., WHITTAKER et al. 2000 ).#[&-h2], 百拇医药

    Our concern is with statistical models and methods for inferring genetic merit using molecular and phenotypic information. The objective is to describe phenotype-marker associations using multilevel hierarchical linear models. The setting is mainly as in WHITTAKER et al. 2000 and LANGE and WHITTAKER 2001 , i.e., situations where conditional independence of genetic sampling can be assumed, such as in an F2 population derived from a cross between inbred lines. Model features include chromosome-specific effects, spatial associations of markers within chromosomes, existence of background genetic variability, and heterogeneity among families, if some such clustering exists. Classical and Bayesian methods are described. First, we present a mixed-effects model formulation and a BLUP implementation, with the dispersion components estimated by some likelihood-based procedure. An extension of the mixed-effects model is given subsequently. Finally, a Bayesian formulation is presented, including a Markov chain Monte Carlo procedure for drawing samples from target posterior distributions. Possible applications to outbred populations are discussed.

    A MIXED-EFFECTS MODEL FORMULATION@2, 百拇医药

    Hierarchical representation:@2, 百拇医药

    Let the phenotypic value of individual i for a quantitative trait in an F2 from a cross between inbred lines be described by the model@2, 百拇医药

    Here, the p x 1 vector ß contains some systematic effects representing, e.g., year of harvest, level of fertilization, or plant density; x'i is a known incidence vector relating ß to yi; ai is an unobserved genetic value; and ei is an independently distributed random residual reflecting environmental variability or inadequacy of the model. It is assumed that the genetic value ai results from an unknown number (K, say) of QTL acting additively, so that@2, 百拇医药

    where Qik is the genotype at biallelic (assumed for simplicity) locus k for individual i and {alpha} k is the per-allele effect. If ai is random, (1) is a special case of the well-known mixed linear model (e.g., HENDERSON 1973 ). When K goes to infinity this becomes the classical infinitesimal model of quantitative genetics.

    It is conceptually convenient to develop model (1) hierarchically. The first level of a Gaussian hierarchy is given by-\6w, 百拇医药

    where N (·) indicates a normal distribution and {sigma} 2e is the environmental variance. If the n environmental deviates are independently and identically distributed, this leads to the matrix representation-\6w, 百拇医药

    where X is an n x p incidence matrix assumed (without loss of generality) to have full-column rank, a = {ai}, and I denotes an identity matrix, in this case n x n. Suppose next that individual i has been typed for marker genotypes at each of l loci; this is represented by the vector-\6w, 百拇医药

    Assume that all individuals have been typed for all markers (this is not realistic, but see the DISCUSSION). Then, the unobserved genetic value can be modeled as-\6w, 百拇医药

    where = {k} is of order l x 1. We refer to k as the "marked effect" of marker locus k on genetic value, while {epsilon} i can be interpreted as some residual or "background" genetic effect not involved in the association between genetic value and the markers but, yet, having an effect on phenotype. The vector is the gradient or regression of the unobservable additive genetic value on the observable marker genotype, that is, = ai/mi. As noted by LANDE and THOMPSON 1990 , dominance can be introduced by expanding (4) as, e.g.,

    where m2'i is a row vector with elements consisting of the squares of the corresponding entries of mi. Interactions between marked effects for different marker loci can be modeled via cross-products between appropriate elements of mi. For simplicity, additivity of marked effects is assumed throughout./o&3r|&, 百拇医药

    The second level of the hierarchy is represented by a distribution describing the uncertainty about genetic values, given the marked effects, that is the background genetic variability. We adopt the Gaussian model,/o&3r|&, 百拇医药

    where {sigma} 2{epsilon} is the background additive genetic variance. It is assumed (rightly or wrongly, depending on the context), given the marker genotypes and , that the "background" genetic effects {epsilon} i of different individuals are mutually independent. This implies that either there is no family structure or that, conditionally on the marked effects {gamma} , the family structure is not relevant. Family clustering is taken up in a later section. In matrix notation, and consistently with (4), the assumption of independence leads to

    where M is the n x l matrix of known marker genotypes. Unless there is some prior knowledge about {sigma} 2e and {sigma} 2{epsilon} (or some clustering of individuals, such as a family structure), the background effect {epsilon} i must be lumped together with ei, because of nonidentifiability. On the other hand, if the variances {sigma} 2{epsilon} and {sigma} 2e are known a priori, it is possible to "predict" {epsilon} i distinctly from ei, in the same way that one can predict additive genetic and environmental effects via BLUP when dispersion parameters are known.ig, 百拇医药

    WHITTAKER et al. 2000 and WHITTAKER 2001 treat {gamma} as a fixed parameter and employ ridge regression for estimation of this vector. From a Bayesian perspective (e.g., LINDLEY and SMITH 1972 ; ZELLNER and VANDAELE 1975 ), this is equivalent to regarding {gamma} as having the distribution

    with {sigma} 2 elicited in some manner. Assumption (6) is adopted as the third level of the hierarchy. In a frequentist setting, the assumption states that the marked effects are drawn at random from the multivariate normal distribution (6) in each conceptual repetition of a crossing experiment. In a Bayesian setting, this would be part of the prior ensemble of the model; this is discussed later. Regression coefficients for markers that are not adjacent to QTL are expected to be null under the assumption of no interference (ZENG 1993 ). However, without knowing the location of the QTL in relation to the markers (the usual situation), it is not obvious how such a prior consideration can be incorporated into the model. Since (6) is a prior distribution in the Bayesian sense, its influence on inferences can be tempered by measuring enough individuals. At this point, it suffices to say that can be treated as a random effect merely as a device for obtaining possibly improved (in some sense) predictions of genetic value. HAYES and GODDARD 2001 assumed that unobservable gene or chromosome effects followed a Gamma distribution, so these effects would be strictly positive even though estimated values can be negative. MEUWISSEN et al. 2001 used Gamma deviates for simulating gene effects, but then "tossed a coin" to determine their sign. Therefore, it is unclear what would be gained from assuming a Gamma distribution for the elements of {gamma} . Note that (6) implies that the "marked effects" are independent and identically distributed, but this assumption is relaxed later on. The normality assumption in (6) is probably adequate enough and facilitates computation significantly.

    The three-stage hierarchy can be condensed by inserting (5) into (3), so that the model describing the phenotypic values can be written as+t8cz7d, 百拇医药

    where {epsilon} = {{epsilon} i} and e = {ei}. This can be viewed as a frequentist mixed-effects model, where ß is a fixed location parameter and {gamma} and {epsilon} are random terms. Unless additional assumptions are made or some knowledge about the partition of variance in the population is available, {epsilon} i and ei are "confounded." However, it is conceptually useful to maintain these two vectors as distinct. The marginal (frequentist) distribution of the phenotypes induced by model (8) is the normal process+t8cz7d, 百拇医药

    where MM'{sigma} 2{gamma} is the variance-covariance matrix of marked genotypic values, conditionally on the marker genotypes M observed in the experiment. In scalar notation, the "total variance" of a single observation is

    is interpretable as the fraction of genetic variance attributable to marked effects, in an experiment repeated over and over with the marker genotypes fixed across replications. The variance "due to" the association with the markers is lk=1m2ik{sigma} 2{gamma} , which depends nontrivially on mi, the specific marker genotype of individual i. On the other hand, since the marker genotypes vary at random over replications,l, http://www.100md.com

    where the expectation and covariance matrix are taken over the distribution of marker genotypes in the population. Knowledge of the distribution of marker genotypes is needed for evaluation of (11).l, http://www.100md.com

    Best prediction and best linear prediction:l, http://www.100md.com

    Under standard assumptions, with ß and the variance components 2, 2{epsilon} , and 2e known, the joint distribution of and {epsilon} , given the phenotypes and the marker genotypes, is the multivariate normal process:

    where {tau} = {sigma} 2e/{sigma} 2, {tau} {epsilon} = 2e/2{epsilon} , and+-j, 百拇医药

    The best linear predictor [BLP; also the best predictor (BP) under normality; HENDERSON 1973] of the unobserved total genetic values a = M + is+-j, 百拇医药

    and the variance-covariance matrix of the prediction error (under normality this is also the covariance matrix of the conditional distribution of a) is+-j, 百拇医药

    The best predictor has the smallest possible mean-squared error of prediction. Hence, it would be difficult to improve upon this, provided that the model is reasonable, normality holds, and parameters are known. Thus, (13–16) provide an alternative to a ridge regression approach to prediction. Generalization to multiple traits measured in different individuals is straightforward, but this is not dealt with here. Since, given M, the BLP or BP is unbiased (e.g., HENDERSON 1973), it follows automatically that it is also unbiased unconditionally. However, unconditionally,

    with the expectation taken with respect to the joint distribution of marker genotypes in the entire population. A drawback of BP or BLP is that it is unrealistic to assume that ß, {sigma} 2{gamma} , {sigma} 2{epsilon} , and {sigma} 2e are known without error.5, 百拇医药

    Best linear unbiased prediction:5, 百拇医药

    An obvious improvement is to use BLUP. BLUP takes into account uncertainty about ß, which is not the case of BP or BLP above, where ß is treated as known. Under normality, BLUP(a) can be interpreted as the mean of the conditional distribution of the predictand a = M + {epsilon} , given a vector of "error contrasts," denoted as w. For example, take w = y - X, where is either the ordinary least-squares or the generalized least-squares estimator of ß. In such a setting, BLUP is the best predictor under normality, but only in the class of linear translation invariant predictors (SEARLE 1974 ; GIANOLA and GOFFINET 1982 ). It is well known that for (8) and (9),(Daniel Gianola Miguel Perez-Enciso and Miguel A. Toro)